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Abstract 

We present a formal verification of the transient fault 
recovery aspects of the Reliable Computing Platform 
(RCP), a fault-tolerant computing system architec- 
ture for digital flight control applications. The RCP 
uses NMR-style redundancy to mask faults and inter- 
nal majority voting to purge the effects of transient 
faults. The system design has been formally speci- 
fied and verified using the Ehdm verification system. 
Our formalization accommodates a wide variety of 
voting schemes for purging the effects of transients. 

Key Words — Correctness proofs, fault toler- 
ance , formal methods , majority voting, modular re- 
dundancy, theorem proving, transient fault recovery. 

1 Introduction 

NASA Langley Research Center (LaRC) is explor- 
ing formal verification as a candidate technology for 
the elimination of design errors in digital fly- by- wire 
control systems. In previous reports [1, 2], we put 
forward a high level architecture for a reliable com- 
puting platform (RCP) based on fault- tolerant com- 
puting principles. Central to this work is the for- 
mal verification of a fault-tolerant operating system 
that schedules and executes the application tasks of 
a flight control system. RCP is designed to auto- 
matically purge the effects of transients periodically. 
Emphasis has been placed on techniques that mathe- 
matically show when the desired recovery properties 
are obtained. Moreover, specifications and proofs 
have been mechanized using the Ehdm verification 
system [5]. 

RCP contains a well-defined operating system that 
provides the applications software developer a reli- 
able mechanism for dispatching periodic tasks on a 
fault-tolerant computing base that appears to him as 
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a single ultra- reliable processor. A four-level hier- 
archical decomposition of RCP has been performed. 
The top level of the hierarchy describes the operating 
system as a function that sequentially invokes appli- 
cation tasks. This view of the operating system will 
be referred to as the uniprocessor model , which forms 
the top-level requirement for the RCP. 

Fault tolerance is achieved by voting the results 
computed by the replicated processors operating on 
identical inputs. Interactive consistency checks on 
sensor inputs and voting of actuator outputs requires 
synchronization of the replicated processors. The 
second level in the hierarchy describes the operating 
system as a frame-synchronous system where each 
replicated processor executes the same application 
tasks. The existence of a global time base, an inter- 
active consistency mechanism and a reliable voting 
mechanism are assumed at this level. 

Level 3 of the hierarchy breaks a frame into four 
sequential phases. This allows a more explicit mod- 
eling of interprocessor communication and the time 
phasing of computation, communication, and voting. 
The use of this intermediate model avoids introduc- 
ing these issues along with those of real time, thus 
preventing an overload of details in the proof process. 

At the fourth level, the assumptions of the syn- 
chronous model must be discharged. Clock synchro- 
nization algorithms can serve as a foundation for the 
implementation of the replicated system as a collec- 
tion of asynchronously operating processors. Ded- 
icated hardware implementations of the clock syn- 
chronization function are a long-term goal. 

Figure 1 depicts the generic hardware architec- 
ture assumed for implementing the replicated sys- 
tem. Single-source sensor inputs are distributed 
by special purpose hardware executing a Byzantine 
agreement algorithm. Replicated actuator outputs 
are all delivered in parallel to the actuators, where 
force-sum voting occurs. Interprocessor communi- 
cation links allow replicated processors to exchange 
and vote on the results of task computations. 
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Figure 1: Generic hardware architecture. 

2 Modeling approach 
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Figure 2: States, transitions, and mappings. 

are related to the state values at the top level by 
way of a mapping function, map. To establish that 
the bottom level implements the top level one must 
show that the diagram commutes (in a sense meant 
for relations): 

Nbottom{s,t) D Mtop(map(s), map(t )) 

where map(s ) = s' and map(t ) = t' in the diagram. 
One must also show that initial states map up: 

Zbottom( s ) 3 3'top( i n r iap(s}') 


The specification of the Reliable Computing Plat- 
form (RCP) is based on state machine concepts. A 
system state models the memory contents of all pro- 
cessors as well as auxiliary variables such as the fault 
status of each processor. This latter type of infor- 
mation may not be observable by a running system, 
but provides a way to express precise specifications. 
System behavior is described by specifying an initial 
state and the allowable transitions from one state to 
another. 

The RCP specification consists of four sepa- 
rate models of the system: Uniprocessor Sys- 

tem (US), Replicated Synchronous (RS), Distributed 
Synchronous (DS), Distributed Asynchronous (DA). 
These models correspond to the four design layers 
outlined in the introduction. We focus on the US 
and RS layers in this paper. 

The proof method is a variation of the classical al- 
gebraic technique of showing that a homomorphism 
exists. Such a proof can be visualized as showing 
that a diagram “commutes” (figure 2). Consider 
two adjacent levels of abstraction, called the top and 
bottom levels for convenience. At the top level we 
have a current state, s', a destination state, t\ and 
a transition that relates the two. The properties of 
the transition are given as a mathematical relation, 
Similarly, the bottom level consists of 
states, s and t ) and a transition that relates the two, 
-V& 0 <tom(s, 0- The state values at the bottom level 


3 Design specifications 

The US specification is very simple: 

J\f U s'- functionjPstate, Pstate, inputs — *• bool] = 
(\s,t y u : t = fc(u,s)) 

The function N us defines the transition relation be- 
tween the current state s and the next state t with 
sensor input u. We require that the computation 
performed by the uniprocessor system is determinis- 
tic and can be modeled by a function f c : inputs x 
Pstate — » Pstate. 

At the RS layer of design, the state is replicated 
and a postprocessing step is added after computa- 
tion. This step represents the voting of state vari- 
ables, which may be selectively applied. 

The state of a single processor is modeled by a 
record named rs_proc_state. The first field of the 
record is healthy, which is 0 when a processor is 
faulty. Otherwise, it indicates the (unbounded) num- 
ber of state transitions since the last fault. A pro- 
cessor that is recovering from a transient fault is in- 
dicated by a nonzero value of healthy less than the 
constant recovery ..period. A processor is said to be 
working whenever healthy > recovery-period. The 
second field of the record is the computation state 
of the processor. It takes values from the same do- 
main as used in the US specification. The complete 
state at this level, RSstate, is a vector (or array) of 
these records. 
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3.1 Transition relation 

The RS transition relation, Af r9} is conditioned on 
the nonfaulty status of each processor: 

A f r3 \ function[RSstate, RSstate, 

inputs — ► bool] = 

( A 5 , t, u : (3 h : ( V t : 
s(i). healthy > 0 D 

good_valuesjsent(s, w, h(i)) A 
voted Jinal-state(s, t, u, h % i))) 

A allowable-faults(s, t)) 

This relation is defined in terms of three subfunc- 
tions: goocLvalues-sent, voted.fi naLstate, and allow- 
able-faults. The first aspect of this definition to 
note is that the relation holds only when allow- 
able-faults is true. This corresponds to the “Max- 
imum Fault Assumption” discussed in [1], namely 
that all reachable states must have a majority 
of working processors. The next thing to no- 
tice is that the transition relation is defined in 
terms of a conjunction good_values_sent(s ( u,h(i)) A 
voted-final-state(s,t,u J h,i). The meaning is that the 
outputs produced by the good processors are con- 
tained in the mailbox vector h, and the final state t 
is obtained by voting the h values. 

Two uninterpreted functions are assumed to ex- 
press specifications that involve selective voting on 
portions of the computation state. 

/*: function[Pstate — ► MB] 

/„: function[Pstate, MBvec -► Pstate] 

These two functions split up the selective voting pro- 
cess to mirror what happens in the RCP architec- 
ture. First, f s is used to select a subset of the state 
components to be voted during the current frame. 
The choice of which components to vote is assumed 
to depend on the computation state. It maps into 
the type MB, which stands for a mailbox item. Sec- 
ond, the function f v takes the current state value and 
overwrites selected portions of it with voted values 
derived from a vector of mailbox items. 

3.2 Generic fault tolerance 

To model a very general class of transient fault re- 
covery schemes, we seek to parameterize the spec- 
ifications as much as possible. This parameteriza- 
tion takes the form of a set of uninterpreted con- 
stants, types, and functions along with axioms to 
constrain their values. Using this method, a wide 
variety of voting patterns are covered by the model, 
from highly frequent voting to minimally frequent 
voting. 


We assume the state contains a control portion, 
used to schedule and manage computation, and a 
vector of ceils, each individually accessible and hold- 
ing application-specific state information. Also as- 
sumed is the existence of access functions to extract 
and manipulate these items from a Pstate value. 

For every application-specific transient fault recov- 
ery scheme to be used with RCP, we must be able 
to determine when individual state components have 
been recovered. This condition is expressed in terms 
of the current control state and the number of non- 
faulty frames since the last transient fault. Two un- 
interpreted functions are provided for this purpose. 

rec: function[cell, control-state, nat — ► bool] 

The predicate rec(c, If, If) is true ifF cell c’s state 
should have been recovered when in control state K 
with healthy frame count H . 

dep: function[cell, cell, control-state bool] 

The predicate dep(c,d,K) indicates that cell c’s 
value in the next state depends on cell d’s value in 
the current state, when in control state K. This 
notion of dependency is different from the notion of 
computational dependency; it determines which cells 
need to be recovered in the current frame on the re- 
covering processor for cell c’s value to be considered 
recovered at the end of the current frame. 

Having postulated several functions that charac- 
terize a generic fault- tolerant computing application, 
it is necessary to introduce axioms that sufficiently 
constrain these functions. Eight axioms are provided 
in the theory for this purpose. Once concrete defi- 
nitions for the functions have been chosen, these ax- 
ioms must be proved to follow as theorems for the 
RCP results to hold for a given application. 

4 RS layer proof 

Proving that the RS state machine correctly im- 
plements the US state machine involves introduc- 
ing a mapping between states of the two machines. 
The function RSmap defines the required mapping, 
namely the majority of Pstate values over all the pro- 
cessors. 

The two theorems required to establish that RS 
implements US are the following. 

fra me_com mutes: Theorem 

reachable(s) AAf r3 {s, t, u) 

D A/^RSmap^), RSmap(f), u) 

initial-maps: Theorem 

initiaLrs(s) D initial-us(RSmap(s)) 
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The theorem fra me.com mutes shows that a succes- 
sive pair of reachable RS states can be mapped by 
RSmap into a successive pair of US states. The the- 
orem initiaLmaps shows that an initial RS state can 
be mapped into an initial US state. 

Proofs for the two main theorems are supported by 
a handful of lemmas. The most important is a state 
invariant that relates values of various state compo- 
nents to their corresponding consensus values. 

5 Implementation issues 

Recovery of state information following a transient 
fault occurs gradually, one cell at a time, possibly 
taking many frames to complete. Depending on the 
voting pattern used, some tasks will be executing in 
the presence of erroneous state information. Conse- 
quently, steps must be taken to prevent errors from 
propagating to already-recovered cells in a proces- 
sor’s state. 

Implicit in the RS specifications is that the com- 
putation of task outputs is not subject to interfer- 
ence by other tasks executing with erroneous data 
inputs. Nonetheless, in a real processor a program 
in execution can interfere with another’s data un- 
less hardware protection mechanisms are in place. In 
a similar manner, interference can be caused in the 
time domain as well as the data domain. Therefore, 
hardware protection features are required to prevent 
both kinds of interference in a system that attempts 
to recover state information selectively. 

There are several well-known hardware techniques 
for providing this type of protection. RCP imple- 
mentations will need to use memory write-protection 
mechanisms, watch-dog timers, and privileged op- 
erating modes to ensure that tasks cannot interfere 
with one another during the incremental process of 
recovering state information. 

6 Conclusion 

We have described a formalization of the transient 
fault recovery aspects of a reliable computing plat- 
form (RCP). The top two specification layers are 
quite abstract and should serve as a model for many 
fault-tolerant system designs. Specification of redun- 
dancy management and transient fault recovery are 
based on a very general model of fault-tolerant com- 
puting similar to one proposed by Rushby [3, 4]. A 
wide spectrum of provable voting schemes is accom- 
modated, offering a rich solution space to the appli- 
cation designer. The Replicated Synchronous layer 
of specification has been completely proved to the 


standards of rigor of the Ehdm mechanical proof sys- 
tem. 
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